Matching Titles with Cross Title Web-Search Enrichment and Community Detection
نویسندگان
چکیده
Title matching refers roughly to the following problem. We are given two strings of text obtained from di↵erent data sources. The texts refer to some underlying physical entities and the problem is to report whether the two strings refer to the same physical entity or not. There are manifestations of this problem in a variety of domains, such as product or bibliography matching, and location or person disambiguation. We propose a new approach to solving this problem, consisting of two main components. The first component uses Web searches to “enrich” the given pair of titles: making titles that refer to the same physical entity more similar, and those which do not, much less similar. A notion of similarity is then measured using the second component, where the tokens from the two titles are modelled as vertices of a “social” network graph. A “strength of ties” style of clustering algorithm is then applied on this to see whether they form one cohesive “community” (matching titles), or separately clustered communities (mismatching titles). Experimental results confirm the e↵ectiveness of our approach over existing title matching methods across several input domains.
منابع مشابه
Using Web Page Titles to Rediscover Lost Web Pages
Titles are denoted by the TITLE element within a web page. We queried the title against the the Yahoo search engine to determine the page’s status (found, not found). We conducted several tests based on elements of the title. These tests were used to discern whether we could predict a pages status based on the title. Our results increase our ability to determine bad titles but not our ability t...
متن کاملSyntactic Structures and Rhetorical Functions of Electrical Engineering, Psychiatry, and Linguistics Research Article Titles in English and Persian: A Cross-linguistic and Cross-disciplinary Study
A research article (RA) title is the first and foremost feature that attracts the reader's attention, the feature from which she/he may decide whether the whole article is worth reading. The present study attempted to investigate syntactic structures and rhetorical functions of RA titles written in English and Persian and published in journals in three disciplines of Electrical Engineering, Psy...
متن کاملA Hybrid Model Words-Driven Approach for Web Product Duplicate Detection
The detection of product duplicates is one of the challenges that Web shop aggregators are currently facing. In this paper, we focus on solving the problem of product duplicate detection on the Web. Our proposed method extends a state-of-the-art solution that uses the model words in product titles to find duplicate products. First, we employ the aforementioned algorithm in order to find matchin...
متن کاملEffectiveness of Title-search vs. Full-text Search in the Web
Search engines sometimes apply the search on the full text of documents or web-pages; but sometimes they can apply the search on selected parts of the documents only, e.g. their titles. Full-text search may consume a lot of computing resources and time. It may be possible to save resources by applying the search on the titles of documents only, assuming that a title of a document provides a con...
متن کاملSyntactic Structures in Research Article Titles from Three Different Disciplines: Applied Linguistics, Civil Engineering, and Dentistry
Deducing what a paper is about, titles are considered as the most important determinant of how many people will read the article. Therefore, studying the use of different syntactic structures and their rhetorical functions in titles is of great significance. The current study was set to investigate these structures used in research article titles in three disciplines of Applied Linguistics, Den...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PVLDB
دوره 7 شماره
صفحات -
تاریخ انتشار 2014